-
Notifications
You must be signed in to change notification settings - Fork 53
fix(r): Collect array streams in C (not R) before conversion #828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
96bd235 to
5861ab7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors array stream collection to occur in C/C++ rather than R, addressing garbage collection performance issues caused by preserve/protect volume problems when handling large numbers of arrays in R.
Key changes:
- Adds C++ implementation for collecting array stream batches into a vector
- Introduces nanoarrow C++ header files with unique pointer wrappers and utility classes
- Updates R code to use the new C-based collection approach
- Removes the deletion of .hpp files from the bootstrap process
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| r/tools/make-callentries.R | Updates file pattern to include .cc extension |
| r/src/nanoarrow_ipc.hpp | New header defining IPC-related unique pointer wrappers |
| r/src/nanoarrow_cpp.cc | Implements nanoarrow_c_collect_array_stream function for batch collection |
| r/src/nanoarrow.hpp | New comprehensive header with C++ helpers for Arrow structures |
| r/src/init.c | Registers new C function for array stream collection |
| r/bootstrap.R | Removes cleanup of .hpp files |
| r/R/convert-array-stream.R | Refactors to use new C-based collection approach |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
Hi @paleolimbot thank you very much for the work on this. I tried this, and yep this indeed solves the crushingly slow gc() i observed in #822. It is still a little bit slower than what i tried to do in PR #823 though, but that's fine i suppose. My understanding is that we are still restreaming, just in C++ and no longer in R, sounds good. Just curious though, given we are collecting the stream into batches of arrays anyway, why not directly convert the arrays, like what i did in 823? Why the restream? |
This PR moves the proecess of collecting an array stream from R (where we had preserve/protect volume issues that made garbage collection very, very slow) into C/C++.
Doesn't quite solve #822 but it should help!
Reproducer for generating an IPC file with a lot of strings:
Details
...in a separate R session, the issues around taking a long time for the GC to run seemed to go away (but it would be great to have a check!)